Humanity’s Last Exam (HLE)

The hardest AI benchmark ever built: 2,500 expert-level questions designed to be the final closed-ended academic exam for AI

Published

August 20, 2025

Keywords: Humanity’s Last Exam, HLE, AI benchmark, frontier LLM evaluation, CAIS, Scale AI, expert-level questions, calibration error, MMLU saturation, multi-modal benchmark, LLM leaderboard

Introduction

AI benchmarks are critical for measuring LLM progress — but most of them are already saturated. Frontier models now score over 90% on popular benchmarks like MMLU and GPQA, making them ineffective at distinguishing between state-of-the-art models.

Humanity’s Last Exam (HLE) was created to address this. It is a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind. Where other benchmarks have become routine for frontier LLMs, HLE remains brutally difficult — with even the best models scoring well below 50%.

“HLE tests structured academic problems rather than open-ended research or creative problem-solving abilities, making it a focused measure of technical knowledge and reasoning. HLE may be the last academic exam we need to give to models, but it is far from the last benchmark for AI.” — HLE Paper

graph LR
    A["Traditional Benchmarks<br/>(MMLU, GPQA, etc.)<br/>90%+ accuracy"] --> B["Benchmark<br/>Saturation"]
    B --> C["Humanity's Last Exam<br/>2,500 expert questions<br/>Best models < 45%"]
    C --> D["Meaningful signal<br/>for frontier AI"]

    style A fill:#e74c3c,stroke:#333,color:#fff
    style B fill:#f39c12,stroke:#333,color:#fff
    style C fill:#27ae60,stroke:#333,color:#fff
    style D fill:#3498db,stroke:#333,color:#fff

What Is Humanity’s Last Exam?

Humanity’s Last Exam (HLE) is a multi-modal benchmark consisting of 2,500 questions across dozens of academic subjects — mathematics, humanities, natural sciences, and more. It was designed to test both:

  • Depth of reasoning — world-class mathematical and scientific problems
  • Breadth of knowledge — questions spanning over 100 subject areas

Key Characteristics

Feature Details
Total questions 2,500 (public) + private held-out set
Subjects covered 100+ across math, humanities, natural sciences
Question types Multiple-choice (24%) and short-answer (76%)
Multi-modal 14% of questions require understanding images/diagrams
Grading Automated (closed-form, unambiguous answers)
Anti-contamination Private test set to detect overfitting; canary strings

What Makes It So Hard?

Every question in HLE was:

  1. Created by subject-matter experts — nearly 1,000 contributors across 500+ institutions in 50+ countries (professors, researchers, PhD holders)
  2. Required to stump frontier LLMs — a question only passed the initial bar if models could not answer it correctly
  3. Manually reviewed by expert reviewers with graduate degrees in relevant fields
  4. Verified unsearchable — questions that could be easily answered via web search were removed

The dataset started with over 70,000 submissions. Only 13,000 passed the LLM difficulty filter. After expert human review, a finalized set of 2,500 public questions remained.

graph TD
    A["70,000+ submissions<br/>from global experts"] --> B["13,000 passed<br/>LLM difficulty filter"]
    B --> C["Expert human review<br/>(graduate-level reviewers)"]
    C --> D["2,700 accepted"]
    D --> E["Remove searchable<br/>& flagged questions"]
    E --> F["2,500 finalized<br/>public questions"]

    style A fill:#ecf0f1,color:#333,stroke:#bdc3c7
    style B fill:#f39c12,color:#fff,stroke:#333
    style C fill:#e67e22,color:#fff,stroke:#333
    style D fill:#3498db,color:#fff,stroke:#333
    style E fill:#8e44ad,color:#fff,stroke:#333
    style F fill:#27ae60,color:#fff,stroke:#333

Who Built It?

HLE was developed by the Center for AI Safety (CAIS) and Scale AI, with lead authors:

  • Long Phan, Nathaniel Li, Adam Khoja, Richard Ren — Center for AI Safety
  • Alice Gatti, Ziwen Han, Josephina Hu, Hugh Zhang — Scale AI
  • Summer Yue, Alexandr Wang — Scale AI (senior leads)
  • Dan Hendrycks — Center for AI Safety (senior lead)

Contributors competed for a $500,000 USD prize pool ($5,000 for each of the top 50 questions, $500 for the next 500 questions), along with optional co-authorship.

Publication

HLE was published in Nature (Nature 649, 1139–1146, January 2026), one of the most prestigious scientific journals, underscoring its significance to the research community.

Resource Link
Nature paper nature.com/articles/s41586-025-09962-4
arXiv preprint arxiv.org/abs/2501.14249
Website lastexam.ai
GitHub github.com/centerforaisafety/hle

What Skills Does It Test?

Unlike narrowly focused benchmarks, HLE tests a broad spectrum of expert-level academic capabilities:

graph TD
    HLE["Humanity's Last Exam<br/>2,500 questions"] --> M["Mathematics<br/>& Logic"]
    HLE --> S["Natural Sciences<br/>Physics, Chemistry, Biology"]
    HLE --> H["Humanities<br/>History, Classics, Philosophy"]
    HLE --> CS["Computer Science<br/>& Engineering"]
    HLE --> Med["Medicine<br/>& Life Sciences"]
    HLE --> Other["Other Disciplines<br/>Economics, Law, Linguistics..."]

    style HLE fill:#e74c3c,color:#fff,stroke:#333
    style M fill:#3498db,color:#fff,stroke:#333
    style S fill:#27ae60,color:#fff,stroke:#333
    style H fill:#f39c12,color:#fff,stroke:#333
    style CS fill:#8e44ad,color:#fff,stroke:#333
    style Med fill:#e67e22,color:#fff,stroke:#333
    style Other fill:#6cc3d5,color:#fff,stroke:#333

Capability What HLE Tests
Deep reasoning Multi-step mathematical proofs, complex derivations
Expert knowledge Cutting-edge scientific facts, obscure domain knowledge
Multi-modal understanding Questions with diagrams, inscriptions, chemical structures
Calibration Whether models know what they don’t know (confidence estimation)
Resistance to search Knowledge that cannot be trivially retrieved via internet search

Example Questions

HLE questions span extraordinary breadth — from translating Palmyrene script on Roman tombstones (Classics) to identifying the number of paired tendons supported by a hummingbird’s sesamoid bone (Ecology/Anatomy). This diversity is what makes HLE uniquely challenging.

Current Leaderboard

The leaderboard below shows model accuracy on HLE as published on the SEAL LLM Leaderboard by Scale AI. Rankings use Rank (Upper Bound): 1 + the number of models whose lower CI bound exceeds this model’s upper CI bound, ensuring rankings reflect statistically meaningful differences.

Source: SEAL LLM Leaderboard — Humanity’s Last Exam (consulted March 28, 2026). Dataset updated April 3, 2025, with finalized 2,500 questions. Judge model: o3-mini.

Rank Model Accuracy (%) Calibration Error
1 GPT-5.4 Pro 44.32 ± 1.95 38
2 Gemini 3 Pro Preview 37.52 ± 1.90 57
2 GPT-5.4 (xhigh thinking) 36.24 ± 1.88 42
2 Claude Opus 4.6 (thinking max) 34.44 ± 1.86 46
4 GPT-5 Pro 31.64 ± 1.82 49
6 GPT-5.2 27.80 ± 1.76 45
6 GPT-5 25.32 ± 1.70 50
6 Claude Opus 4.5 (thinking) 25.20 ± 1.70 55
6 Kimi K2.5 24.37 ± 1.81 67
7 GPT-5.1 (thinking) 23.68 ± 1.67 55
9 Gemini 2.5 Pro (Jun 05) 21.64 ± 1.61 72
11 o3 (high) 20.32 ± 1.58 34
11 GPT-5 Mini 19.44 ± 1.55 65
11 o3 (medium) 19.20 ± 1.54 39
11 Claude Opus 4.6 (non-thinking) 19.00 ± 1.54 44

Key takeaway: Even the best frontier model (GPT-5.4 Pro) scores only 44.32% — meaning more than half the questions remain unsolved. Most models exhibit high calibration errors, indicating systematic overconfidence.

For the full, up-to-date leaderboard, visit the links in the next section.

Where to Explore the Benchmark

Dashboards and Leaderboards

Resource Description Link
SEAL LLM Leaderboard Scale AI’s official leaderboard with confidence intervals and calibration labs.scale.com/leaderboard/humanitys_last_exam
CAIS AI Dashboard Center for AI Safety’s dashboard with HLE-Rolling live submission agi.safe.ai/dashboard
HLE Website Official website with paper, results, and progress chart lastexam.ai

Dataset and Code

Resource Description Link
Hugging Face Dataset The full 2,500-question dataset (requires access agreement) huggingface.co/datasets/cais/hle
GitHub Repository Evaluation code, prompts, and documentation github.com/centerforaisafety/hle
arXiv Paper Full technical paper with methodology and analysis arxiv.org/abs/2501.14249
Nature Publication Peer-reviewed publication nature.com/articles/s41586-025-09962-4

Load the Dataset

from datasets import load_dataset

dataset = load_dataset("cais/hle", split="test")

HLE-Rolling

In October 2025, the team released HLE-Rolling — a dynamic, evolving fork of the benchmark that accepts new contributions over time. This ensures HLE remains relevant as models improve.

Understanding the Metrics

Accuracy

The primary metric. Models answer each question, and an automated judge (o3-mini) compares the response against the ground-truth answer. Because answers are closed-form and unambiguous, evaluation is deterministic.

Calibration Error

Models are prompted to provide both an answer and a confidence score (0–100%). Calibration error measures the gap between stated confidence and actual accuracy.

Scenario Confidence Accuracy Calibration
Well-calibrated 50% 50% Good
Overconfident 85% 10% Bad (CE: 75+)
Current frontier models 60–90% 5–45% Bad (CE: 34–89)

Key insight: Most frontier models are systematically overconfident on HLE — they express high confidence even when wrong. This is strong evidence of confabulation/hallucination. The o3 model family shows the best calibration (CE: 34–39), while older models like GPT-4o exhibit calibration errors of 89.

Why HLE Matters

graph LR
    A["Benchmark<br/>Saturation"] --> B["Cannot distinguish<br/>frontier models"]
    B --> C["HLE fills the gap"]
    C --> D["Informed AI policy<br/>& research"]

    A2["Overconfident<br/>models"] --> B2["Calibration errors<br/>not flagged"]
    B2 --> C
    C --> D2["Better safety<br/>assessments"]

    style A fill:#e74c3c,color:#fff,stroke:#333
    style A2 fill:#e74c3c,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style D fill:#3498db,color:#fff,stroke:#333
    style D2 fill:#3498db,color:#fff,stroke:#333

  1. Measures what matters — Expert-level academic reasoning, not just pattern matching
  2. Resists saturation — Even the best models score < 50%
  3. Exposes overconfidence — Calibration metrics reveal when models are hallucinating
  4. Informs policy — Provides a common reference point for scientists and policymakers
  5. Anti-contamination — Private held-out set detects overfitting to the public dataset

Video: Humanity’s Last Exam Explained

Please subscribe to the Vectoring AI YouTube channel for more video tutorials 🚀

Conclusion

Humanity’s Last Exam represents a milestone in AI evaluation:

  • 2,500 expert-crafted questions across 100+ subjects that frontier LLMs still largely cannot solve
  • Built by ~1,000 subject-matter experts from 500+ institutions across 50+ countries
  • Published in Nature — peer-reviewed and validated by the scientific community
  • The best model scores 44% — vast room for improvement remains
  • Calibration errors reveal that models don’t know what they don’t know

As AI capabilities advance, HLE provides a meaningful yardstick for measuring genuine progress — not just incremental improvements on already-saturated benchmarks. When models eventually achieve high accuracy on HLE, it will signal a profound leap in AI’s ability to match expert human knowledge on closed-ended academic questions.

But as the authors note: “HLE may be the last academic exam we need to give to models, but it is far from the last benchmark for AI.”

References

Read More